Let’s assume you are an HR manager (with a soft spot for statistics) and you are trying to find out how to best analyse data from a survey about employee satisfaction.

The consultant you hired cam back with a bunch of colorful charts and employee satisfaction numbers that were precise to 3 digits, but you couldn’t stop thinking of you professors rambling that firms would learn more about their employees if they uses the data better and did some multilevel analysis.

You weren’t entirely convinced because you found his argumentation to abstract, but you also were a good hacker and thought now is the time to just simulate some data and see if one is really doing better with a multilevel analysis.

Here are the things you know about your firm:

Here is this expressed in some variables.

n_departments = 15
set.seed(123)
n_employees = 
  rep(c(5,22,98), each = 5) + 
  rbinom(n_departments,5,.5)

mean_firm_satisfaction = 3
sd_firm_satisfaction = .25

Here is a simple model of the data generating process, and its translation in simulation code:

model simulation code in words
\(\mu_d = normal(3, .25)\)
department_mus = 
  rnorm(n_departments,
        3,.25)
The average satisfaction in departments has a normal distribution with mean_firm_satisfaction = 3 and sd_firm_satisfaction = 0.25
\(S_i = normal(\mu_d, 1)\)
employee_satisfaction = 
  rnorm(n_employees[d],
        department_mus[d],
        1)
The satisfaction of individual employees varies with sd = 1 around the average satisfaction in departments.


No we can simulate satisfaction of employees in the firm:

set.seed(1)
department_mus = 
    rnorm(n_departments,
          mean_firm_satisfaction,
          sd_firm_satisfaction)
firm_satisfaction = c()
for (d in 1:n_departments) {
  employee_satisfaction = 
    rnorm(n_employees[d],
          department_mus[d],
          1)
  firm_satisfaction = rbind(
    firm_satisfaction,
    data.frame(
      department = as.integer(d),
      satisfaction = employee_satisfaction
    )
  )
}

Here is a brief summary of the data:

dep.satisf = 
  describeBy(firm_satisfaction$satisfaction,
           group = firm_satisfaction$department,
           mat = TRUE, skew = FALSE) %>% 
  .[,-1] 

dep.satisf %>% 
  kable(digits = 1, row.names = FALSE) %>% 
  kable_styling(full_width = FALSE)
group1 vars n mean sd min max range se
1 1 7 3.4 0.4 2.8 3.8 1.0 0.2
2 1 8 2.7 0.9 1.1 3.7 2.6 0.3
3 1 7 2.7 0.8 1.4 4.1 2.7 0.3
4 1 9 3.5 0.7 2.7 4.5 1.8 0.2
5 1 9 3.3 0.8 2.0 4.5 2.6 0.3
6 1 23 2.9 1.1 1.0 5.2 4.2 0.2
7 1 25 3.1 0.9 1.6 4.7 3.1 0.2
8 1 26 3.3 0.8 2.5 5.0 2.4 0.1
9 1 25 2.7 0.9 1.2 5.2 4.0 0.2
10 1 24 3.2 1.1 1.4 5.2 3.8 0.2
11 1 102 3.4 1.0 0.5 6.0 5.5 0.1
12 1 100 3.2 1.0 0.5 5.1 4.6 0.1
13 1 101 2.8 1.1 -0.2 5.0 5.2 0.1
14 1 101 2.4 1.1 -0.1 6.3 6.3 0.1
15 1 99 3.1 1.1 0.3 6.0 5.6 0.1

A standard way to visualize these results is to use bar charts of group means, maybe together with error bars. The figure also shows the true department level satisfaction, which we usually cannot observe, as green stars.

plot_data = function(return.x = FALSE) {
  set_par()
  se = dep.satisf$se
  x = barplot(dep.satisf$mean,
              names.arg = dep.satisf$group1,
              xlab = "department",
              ylab = "satisfaction",
              ylim = c(0, max(dep.satisf$mean+1.96*se)))
  
  arrows(x0 = x,
         y0 = dep.satisf$mean - 1.96*se,
         y1 = dep.satisf$mean + 1.96*se,
         length = 0)
  abline(h = mean(firm_satisfaction$satisfaction), col = "red", lty = 3)
  points(x,department_mus, pch = 8, col = "green3")
  
  if (return.x == TRUE)
    return(x)
}

plot_data()

In our simulated world, the true and the measured satisfaction differ due to random measurement error, as indicated by this line in the simulation code: employee_satisfaction = rnorm(n_employees[d], department_mus[d], 1)

3 levels of pooling

How we want to analyse the data depends partially on our assumptions about the data generating process. In the context of multilevel modeling, we can think of three qualitatively different assumptions:

Full pooling analysis

A Bayesian model to estimate identical group means (full pooling) looks as follows:

\[ S_i \sim normal(\bar \mu, \sigma) \\ \bar \mu \sim normal(3,3) \\ \sigma \sim exponential(1) \\ \]

where \(\bar \mu\) is the average satisfaction in the firm.

However, to facilitate the comparison to alternative models, we re-write the model as follows:

Full pooling
\[\begin{align*} S_i \sim normal(\mu_d, \sigma) & \;\;\;\;{\small \textrm{Individuals satisfaction is department satisfaction plus error}} \\ \mu_d \sim normal(\bar \mu,.0001) & \;\;\;\;{\small \textrm{Department satisfaction is equal to average firm satisfaction}} \\ \bar \mu \sim normal(3,3) & \;\;\;\;{\small \textrm{Prior for average firm satisfaction}} \\ \sigma \sim exponential(1) & \;\;\;\;{\small \textrm{Prior for error variance}} \\ \end{align*}\]

\(\mu_d\) are department level satisfactions, which depend on the firm level satisfaction \(\bar \mu\) and the variations between departments. By setting standard deviation for the variation between departments to 0.0001, we are enforcing that the department level means will be equal.

Here are the corresponding ulam models:

m.full_pooling = alist(
  S ~ dnorm(mu_bar, sigma),
  mu_bar ~ dnorm(3,3),
  sigma ~ dexp(2)
)
m.full_pooling.b = alist(
  S ~ dnorm(mu, sigma),
  mu <- mu_bar + z[d]*.0001,
  z[d] ~ dnorm(0,1),
  mu_bar ~ dnorm(3,3),
  sigma ~ dexp(2),
  # calculate mu per department
  # in generated quantities
  gq> vector[d]:mus<<-mu_bar+z*.0001
)

NOTE The second model implements what is called a “non-centered parameterization” of the model. Here are again two formulations of the model:

Centered Non-centered
mus[d] = dnorm(mu_bar, Sigma)
mus = mu_bar + z*Sigma
z[d] ~ dnorm(0,1)

Before we fit the model, we put the data into a list:

u.data = list(
  S = firm_satisfaction$satisfaction,
  d = as.integer(firm_satisfaction$department)
)

Now we estimate the models …

n.cores = ifelse(Sys.info()["sysname"] == "Darwin",4,1)
fit.full_pooling = ulam(
  m.full_pooling,
  data = u.data,
  log_lik = TRUE, iter = 1000,
  chains = 4, cores = n.cores)
fit.full_pooling.b = ulam(
  m.full_pooling.b,
  data = u.data,
  log_lik = TRUE, iter = 1000,
  chains = 4, cores = n.cores)

and check convergence:

precis(fit.full_pooling) %>% 
  round(2)
##        mean   sd 5.5% 94.5%   n_eff Rhat4
## mu_bar 3.00 0.04 2.93  3.07 1475.88     1
## sigma  1.06 0.03 1.02  1.11 1448.46     1
precis(fit.full_pooling.b,
       depth = 2) %>% 
  round(2)
##          mean   sd  5.5% 94.5%   n_eff Rhat4
## z[1]     0.00 1.01 -1.57  1.65 2754.74     1
## z[2]     0.00 1.04 -1.66  1.64 2563.12     1
## z[3]     0.02 0.98 -1.52  1.63 2240.41     1
## z[4]     0.02 0.94 -1.52  1.47 2573.14     1
## z[5]    -0.01 0.99 -1.61  1.54 2618.93     1
## z[6]    -0.02 0.98 -1.61  1.57 2387.78     1
## z[7]     0.02 0.98 -1.54  1.59 1947.16     1
## z[8]     0.00 0.98 -1.58  1.54 1970.63     1
## z[9]     0.00 0.98 -1.54  1.56 2560.57     1
## z[10]   -0.02 0.98 -1.59  1.63 2683.93     1
## z[11]    0.03 1.00 -1.51  1.62 1947.40     1
## z[12]    0.02 0.97 -1.52  1.61 2407.30     1
## z[13]   -0.02 1.01 -1.60  1.53 2050.92     1
## z[14]   -0.01 1.00 -1.64  1.53 2152.81     1
## z[15]    0.01 1.05 -1.65  1.68 2519.87     1
## mu_bar   3.00 0.04  2.94  3.06 2293.38     1
## sigma    1.06 0.03  1.01  1.11 3189.93     1
## mus[1]   3.00 0.04  2.94  3.06 2294.97     1
## mus[2]   3.00 0.04  2.94  3.06 2294.36     1
## mus[3]   3.00 0.04  2.94  3.06 2293.57     1
## mus[4]   3.00 0.04  2.94  3.06 2293.68     1
## mus[5]   3.00 0.04  2.94  3.06 2293.83     1
## mus[6]   3.00 0.04  2.94  3.06 2293.44     1
## mus[7]   3.00 0.04  2.94  3.06 2292.56     1
## mus[8]   3.00 0.04  2.94  3.06 2293.30     1
## mus[9]   3.00 0.04  2.94  3.06 2293.02     1
## mus[10]  3.00 0.04  2.94  3.06 2293.27     1
## mus[11]  3.00 0.04  2.94  3.06 2294.59     1
## mus[12]  3.00 0.04  2.94  3.06 2293.82     1
## mus[13]  3.00 0.04  2.94  3.06 2291.77     1
## mus[14]  3.00 0.04  2.94  3.06 2293.53     1
## mus[15]  3.00 0.04  2.94  3.06 2293.14     1

Do the two models really describe the data equally well, and is the number of effective parameters more or less equal?

compare(fit.full_pooling,
        fit.full_pooling.b) %>% 
  round(2)
##                       WAIC    SE dWAIC  dSE pWAIC weight
## fit.full_pooling.b 1969.22 36.14  0.00   NA  1.90   0.52
## fit.full_pooling   1969.39 36.13  0.17 0.06  1.99   0.48

Indeed, the difference between the two models is very small. More importantly, we can see that the number of effective parameters of the two versions of the full polling model are 1.99 and 1.9, even though the models have 2 and 17 parameters, respectively.

And here is our initial figure with the estimated average employee satisfactions:

x = plot_data(return.x = TRUE)
est.full_pooling = precis(fit.full_pooling.b, pars = "mus", depth = 2)
points(x,est.full_pooling$mean, col = "blue", pch = 16)

As expected we estimated the same mean for everyone

No pooling analysis

A Bayesian model to estimate independent group means looks as follows:

Full pooling No pooling
\[ S_i \sim normal(\mu_d, \sigma) \\ \mu_d \sim normal(\bar \mu,.0001) \\ \bar \mu \sim normal(3,3) \\ \sigma \sim exponential(1)\] \[S_i \sim normal(\mu_d, \sigma) \\ \mu_d \sim normal(\bar \mu,\color{red}{1000}) \\ \bar \mu \sim normal(3,3) \\ \sigma \sim exponential(1)\]

In this model, the large standard deviation of the distribution of department level satisfaction (\(\small \mu_d \sim normal(\bar \mu,1000)\)) encodes the assumption of independent groups.

Getting from a full pooling to a no polling model by changing a model parameter shows that this are not qualitatively distinct models, but opposite endpoints of a continuous model space.

The corresponding ulam model looks as follows:

m.no_pooling = alist(
  S ~ dnorm(mu, sigma),
  mu <- mu_bar + z[d]*1000,
  z[d] ~ dnorm(0,1),
  mu_bar ~ dnorm(3,3),
  sigma ~ dexp(2),
  gq> vector[d]:mus<<-mu_bar+z*1000
)

Now we estimate the model …

fit.no_pooling = ulam(
  m.no_pooling,
  data = u.data,
  log_lik = TRUE, iter = 1000,
  chains = 4
)

and check convergence:

precis(fit.no_pooling,depth = 2) %>% 
  round(2)
##         mean   sd  5.5% 94.5%   n_eff Rhat4
## z[1]    0.00 0.00  0.00  0.01  152.41  1.03
## z[2]    0.00 0.00 -0.01  0.00  151.49  1.03
## z[3]    0.00 0.00 -0.01  0.00  150.10  1.03
## z[4]    0.00 0.00  0.00  0.01  150.37  1.03
## z[5]    0.00 0.00  0.00  0.01  151.15  1.03
## z[6]    0.00 0.00  0.00  0.00  148.87  1.03
## z[7]    0.00 0.00  0.00  0.00  149.98  1.03
## z[8]    0.00 0.00  0.00  0.01  149.66  1.03
## z[9]    0.00 0.00 -0.01  0.00  149.63  1.03
## z[10]   0.00 0.00  0.00  0.00  149.65  1.03
## z[11]   0.00 0.00  0.00  0.01  148.70  1.03
## z[12]   0.00 0.00  0.00  0.00  148.26  1.03
## z[13]   0.00 0.00 -0.01  0.00  149.20  1.03
## z[14]   0.00 0.00 -0.01  0.00  149.02  1.03
## z[15]   0.00 0.00  0.00  0.00  147.86  1.03
## mu_bar  2.90 3.03 -1.74  7.85  148.50  1.03
## sigma   1.02 0.03  0.98  1.06  281.88  1.01
## mus[1]  3.40 0.38  2.79  4.02 1873.91  1.00
## mus[2]  2.66 0.35  2.11  3.21 2177.09  1.00
## mus[3]  2.71 0.37  2.09  3.29 1886.23  1.00
## mus[4]  3.54 0.33  3.04  4.08 2353.43  1.00
## mus[5]  3.34 0.34  2.77  3.87 2378.79  1.00
## mus[6]  2.94 0.21  2.61  3.28 2801.84  1.00
## mus[7]  3.14 0.21  2.81  3.47 2532.67  1.00
## mus[8]  3.34 0.21  3.01  3.67 2707.75  1.00
## mus[9]  2.70 0.21  2.37  3.03 2495.72  1.00
## mus[10] 3.15 0.21  2.83  3.49 2338.38  1.00
## mus[11] 3.37 0.10  3.21  3.53 1816.83  1.00
## mus[12] 3.16 0.10  3.00  3.32 2313.76  1.00
## mus[13] 2.81 0.10  2.65  2.97 2166.62  1.00
## mus[14] 2.40 0.10  2.24  2.57 2077.21  1.00
## mus[15] 3.14 0.10  2.98  3.31 1960.21  1.00

Now lets look at the true and estimated satisfaction in departments, where the estimated satisfaction from the no-pooling model is shown as blue dots:

x = plot_data(return.x = TRUE)
est.no_pooling = precis(fit.no_pooling, pars = "mus", depth = 2)
points(x,est.no_pooling$mean, col = "blue", pch = 16)

As expected, the estimates of this simple, no-pooling, Bayesian model correspond to the averages we calculated earlier, except that we see a bit of shrinkage towards the prior mean for the small departments (1-5).

Paritial pooling analysis

In the full and no pooling analysis, we determined the the degree of pooling by setting the standard deviation for the department level satisfaction to a very low or very high number, respectively.

The key advantage of partial polling models is that instead of setting this standard deviation to a fixed value, they estimate it as a parameter from the data. This allows such models to adapt to situations where the different groups are either similar (low standard deviation) or dissimilar (high standard deviation).

Full pooling No pooling Partial pooling

\[ S_i \sim normal(\mu_d, \sigma) \\ \mu_d \sim normal(\bar \mu,.0001) \\ \bar \mu \sim normal(3,3) \\ \sigma \sim exponential(1)\]

\[S_i \sim normal(\mu_d, \sigma) \\ \mu_d \sim normal(\bar \mu,\color{red}{1000}) \\ \bar \mu \sim normal(3,3) \\ \sigma \sim exponential(1)\]

\[S_i \sim normal(\mu_d, \sigma)\\ {\mu_d} \sim normal(\bar \mu,\color{red}{\Sigma}) \\ \sigma \sim exponential(1) \\ \bar \mu \sim normal(3,3) \\ \color{red}{\Sigma \sim exponential(0,.5)}\]

Here is the partial pooling ulam model:

m.partial_pooling = alist(
  S ~ dnorm(mu, sigma),
  mu <- mu_bar + z[d] * Sigma,
  mu_bar ~ dnorm(3,3),
  sigma ~ dexp(1),
  z[d] ~ dnorm(0,1),
  Sigma ~ dhalfnorm(0,2),
  gq> vector[d]:mus<<-mu_bar+z*Sigma
)

Now we estimate the model …

fit.partial_pooling = ulam(
  m.partial_pooling,
  data = u.data,
  log_lik = TRUE, iter = 1000,
  chains = 4
)
precis(fit.partial_pooling,depth = 2) %>% 
  round(2)
##          mean   sd  5.5% 94.5%   n_eff Rhat4
## mu_bar   3.04 0.09  2.89  3.20  623.24  1.00
## sigma    1.02 0.03  0.98  1.06 2249.05  1.00
## z[1]     0.45 0.78 -0.79  1.71 2040.26  1.00
## z[2]    -0.53 0.76 -1.72  0.67 2311.23  1.00
## z[3]    -0.39 0.81 -1.68  0.92 2419.21  1.00
## z[4]     0.71 0.73 -0.44  1.86 2026.16  1.00
## z[5]     0.43 0.75 -0.75  1.65 2557.04  1.00
## z[6]    -0.20 0.62 -1.21  0.76 1418.61  1.00
## z[7]     0.22 0.59 -0.69  1.20 1661.63  1.00
## z[8]     0.67 0.61 -0.30  1.70 1272.08  1.00
## z[9]    -0.73 0.62 -1.73  0.24 1464.15  1.00
## z[10]    0.26 0.60 -0.70  1.20 1716.11  1.00
## z[11]    0.98 0.48  0.28  1.77  986.03  1.00
## z[12]    0.36 0.44 -0.31  1.06 1061.16  1.00
## z[13]   -0.70 0.43 -1.38 -0.03  938.84  1.00
## z[14]   -1.92 0.57 -2.88 -1.06  778.76  1.00
## z[15]    0.30 0.43 -0.37  1.00 1095.08  1.00
## Sigma    0.31 0.09  0.20  0.47  616.42  1.01
## mus[1]   3.19 0.25  2.81  3.60 2239.54  1.00
## mus[2]   2.88 0.23  2.50  3.24 2537.76  1.00
## mus[3]   2.92 0.25  2.51  3.31 2327.49  1.00
## mus[4]   3.26 0.23  2.91  3.66 1749.68  1.00
## mus[5]   3.17 0.23  2.81  3.57 2768.52  1.00
## mus[6]   2.98 0.18  2.70  3.27 2724.51  1.00
## mus[7]   3.11 0.16  2.85  3.37 3126.59  1.00
## mus[8]   3.24 0.17  2.98  3.52 3420.75  1.00
## mus[9]   2.82 0.17  2.53  3.09 2468.87  1.00
## mus[10]  3.12 0.17  2.84  3.39 2723.14  1.00
## mus[11]  3.33 0.10  3.18  3.49 2410.32  1.00
## mus[12]  3.15 0.10  2.99  3.29 2975.65  1.00
## mus[13]  2.83 0.09  2.68  2.98 3109.48  1.00
## mus[14]  2.47 0.10  2.31  2.63 2101.18  1.00
## mus[15]  3.13 0.10  2.97  3.28 3610.49  1.00

Let’s see if the model detected some variability in department level satisfaction:

post.Sigma = extract.samples(fit.partial_pooling)$Sigma
hist(post.Sigma, xlim = c(0, max(post.Sigma)))
abline(v = sd_firm_satisfaction, lty = 2, lwd = 2, col = "green3")

The vertical green line is the true standard deviation, which shows that the model accurately estimates the variability of departments.

Models in which we estimate the between group standard deviation are called partial pooling models, because they lie somewhere on the continuum between no and full polling.

Now lets look at the estimated means:

x = plot_data(return.x = TRUE)
est.partial_pooling = precis(fit.partial_pooling, pars = "mus", depth = 2)
points(x,est.partial_pooling$mean, col = "blue", pch = 16)

Accuracy

To see which method provides the best estimates of department level satisfaction, we compare each with the true values:

abs.delta = rbind(
  no_pooling = est.no_pooling$mean - department_mus,
  full_pooling = est.full_pooling$mean - department_mus,
  partial_pooling = est.partial_pooling$mean - department_mus
) 

Here is a plot of these deviations:

abs.delta = abs.delta %>% abs() %>% t()
matplot(abs.delta, pch = 16, col = 2:4)
legend("topleft", bty = "n",
       title = "Pooling",
       col = 2:4,
       pch = 16,
       legend = gsub("_pooling","",colnames(abs.delta)))

It seems as if the partial pooling model is doing best, but it is not 100% clear.

A standard metric to calculate the performance of different models is the root means square deviation (RMSD), which we calculate now:

abs.delta^2 %>%    # square deviation
  colMeans() %>%   # mean
  sqrt()           # root
##      no_pooling    full_pooling partial_pooling 
##       0.2415329       0.2468614       0.1635485

Indeed, the partial pooling model as the smallest error. This is due to the same property of shrinkage estimators we already observed earlier in the course: Shrinkage estimators fit the data not as well as other models (here the no pooling model) but they are better at out of sample prediction.

Note though, that this is not a big surprise, because this is how we specified the data generating process.

Here is a function that produces the same plot we just saw, and that takes the between department standard deviation as an input. This allows us to test if the partial pooling model does well, even if a full or now pooling model is the true DGP.

check.accuracy = function(sd_firm_satisfaction = .25, n_departments = 15) {
  fn = paste0("accuracy_",100*sd_firm_satisfaction,"_",n_departments,".Rdata")
  
  n_employees = 
    rep(c(5,22,98), each = n_departments/3) + 
    rbinom(n_departments,n_departments/3,.5)
  
  if (file.exists(fn)) {
    load(fn)
  } else {
     # simulate data
  department_mus = 
    rnorm(n_departments,
          mean_firm_satisfaction,
          sd_firm_satisfaction)
  firm_satisfaction = c()
  for (d in 1:n_departments) {
    employee_satisfaction = 
      rnorm(n_employees[d],
            department_mus[d],
            1)
    firm_satisfaction = rbind(
      firm_satisfaction,
      data.frame(
        department = as.integer(d),
        satisfaction = employee_satisfaction
      )
    )
  }
  
  #prepare ulam.data
  u.data = list(
    N = nrow(firm_satisfaction),
    S = firm_satisfaction$satisfaction,
    d = as.integer(firm_satisfaction$department))
  
  # fit.models
  fit.full_pooling = ulam(m.full_pooling.b,data = u.data,chains = 4, cores = n.cores)
  fit.no_pooling = ulam(m.no_pooling,data = u.data,chains = 4, cores = n.cores)
  fit.partial_pooling = ulam(m.partial_pooling,data = u.data,chains = 4, cores = n.cores)
  
  save(fit.full_pooling,
       fit.no_pooling,
       fit.partial_pooling,
       department_mus,u.data,
       firm_satisfaction,
       file = fn)
  }
  
 # get estimates
  est.full_pooling = precis(fit.full_pooling, pars = "mus", depth = 2)
  est.no_pooling = precis(fit.no_pooling, pars = "mus", depth = 2)
  est.partial_pooling = precis(fit.partial_pooling, pars = "mus", depth = 2)
  
  abs.delta = rbind(
    no_pooling = est.no_pooling$mean - department_mus,
    full_pooling = est.full_pooling$mean - department_mus,
    partial_pooling = est.partial_pooling$mean - department_mus
  ) %>% abs() %>% t()
  
  layout(matrix(c(1,1,2,3),ncol = 2))
  RMSD = abs.delta^2 %>% colMeans() %>% sqrt() 
  matplot(abs.delta, pch = 16, col = 2:4, ylab = "RMSD",
          main = paste0("sd = ", sd_firm_satisfaction, ", ",n_departments," groups"))
  legend("topleft", bty = "n",
         title = "Pooling",
         col = 2:4,
         pch = 16,
         legend = gsub("_pooling","",colnames(abs.delta)))
  
  barplot(RMSD, col = 2:4)
  
  post.Sigma = extract.samples(fit.partial_pooling)$Sigma
  hist(post.Sigma, xlim = c(0, max(post.Sigma)), probability = T)
  curve(dnorm(x,0,2), add = T, col = "blue")
  abline(v = 0, lty = 2, col = "red")
  abline(v = sd_firm_satisfaction, lty = 2, lwd = 2, col = "green3")
}

Here is the performance of the three models if there is no variation between groups/departments.

set_par(cex = 1, mar=c(3,3,3,.5))
check.accuracy(sd_firm_satisfaction = .01, n_departments = 15)

Here is the performance of the three models if there is some variation between groups/departments.

set_par(cex = 1, mar=c(3,3,3,.5))
check.accuracy(sd_firm_satisfaction = 1, n_departments = 15)

Here is the performance of the three models if there is a lot of variation between groups/departments.

set_par(cex = 1, mar=c(3,3,3,.5))
check.accuracy(sd_firm_satisfaction = 10, n_departments = 15)

The partial pooling model is generally nearly as accurate as the “true” model. Obviously, this also depends on how much data is available! The last figures are based on a large number (30) of groups, but if the number of groups becomes small and each group has only few members, a partial pooling model will have difficulties estimating the standard deviation for the between group variation.

Bias–variance tradeoff

One issue with multilevel models (with partial pooling) is that they are not “unbiased”, which means that they do not provide an estimate of the sample mean.

Lets look at the means from the no pooling an partial pooling models, together with the sample means:

sample_means = 
  describeBy(firm_satisfaction$satisfaction, 
             group = firm_satisfaction$department, 
             mat = TRUE)[,"mean"]
set_par()
plot(1:15, sample_means, ylab = "mean", xlab = "department", cex = 1.5)
points(1:15, est.no_pooling$mean, pch = 16, col = 2)
points(1:15, est.partial_pooling$mean, pch = 16, col = 4)
abline(h = mean(u.data$S), lty = 3)
arrows(x0 = 1:15,
       y0 = est.no_pooling$mean,
       y1 = est.partial_pooling$mean,
       col = 4, lty = 3, length = .1)
legend("topright",
       pch = c(1,16,16),
       col = c(1,2,4), bty = "n",
       legend = c("dep.mean","no pooling", "partial pooling"))

The deviation between sample mean and partial pooling estimate shows that the latter is a biased estimate of the sample mean. This bias is larger for smaller departments (on the left hand side) which which are shrunk more, and it is also larger for departments with extreme means, compare e.g. departments 13 and 14.

Now one could be concerned that partial pooling makes it difficult to detect differences between groups. But the bias just described is only part of the story. Next, we plot the estimates together with their credible intervals:

set_par()
ylim = range(c(est.no_pooling[,3:4],est.partial_pooling[, 3:4]))
plot(1:15, sample_means, ylab = "mean", xlab = "department", cex = 1.5, ylim = ylim)
points((1:15)-.1, est.no_pooling$mean, pch = 16, col = 2)
points((1:15)+.1, est.partial_pooling$mean, pch = 16, col = 4)
abline(h = mean(u.data$S), lty = 3)
arrows(x0 = (1:15)-.1,
       y0 = est.no_pooling$`5.5%`,
       y1 = est.no_pooling$`94.5%`,
       col = 2, length = 0)
arrows(x0 = (1:15)+.1,
       y0 = est.partial_pooling$`5.5%`,
       y1 = est.partial_pooling$`94.5%`,
       col = 4, length = 0)
legend("topright",
       pch = c(1,16,16),
       col = c(1,2,4), bty = "n",
       legend = c("dep.mean","no pooling", "partial pooling"))

What do you see?

The variance of the estimates from the hierarchical partial pooling model is generally smaller. The intuition her is that because the estimate for one department also relies partially on data from other departments, the estimated mean for the department was derived from more data and is therefore less uncertain.

This reduction in bias means that even though all estimates were shrunk towards the global mean and group differences became smaller, we can still detect group differences well due to the reduces variance of the estimates for the group mean.

More generally, it can be said that total error of an estimate can be decomposed into:

VB = function(truth, estimate) {
  return(c(
    variance = mean((estimate-mean(estimate))^2),
    bias = mean((truth-mean(estimate))^2)
    ))
}

Shrinkage methods and in particular multilevel regression and partial pooling make a different trade off than the more traditional no-pooling estimate by exchanging some bias for a reduction in variance.

The following figure shows how partial pooling trades off variance for bias:

VB.partial_pooling = 
  do.call(rbind,lapply(1:15, function(d) {
    VB(sample_means[d],extract_post_ulam(fit.partial_pooling,pars = "mus")[[1]][,d])
  }))
VB.no_pooling = 
  do.call(rbind,lapply(1:15, function(d) {
    VB(sample_means[d],extract_post_ulam(fit.no_pooling,pars = "mus")[[1]][,d])
  }))

set_par()
xlim = range(c(VB.partial_pooling[,1],VB.no_pooling[,1]))
ylim = range(c(VB.partial_pooling[,2],VB.no_pooling[,2]))
plot(0, type = "n", col = 4,
     ylim = ylim, xlim = xlim,
     ylab = "Bias", xlab = "Variance")
arrows(x0 = VB.no_pooling[,1],
       y0 = VB.no_pooling[,2],
       x1 = VB.partial_pooling[,1],
       y1 = VB.partial_pooling[,2],
       length = .1, col = "grey")
text(VB.partial_pooling[,1], VB.partial_pooling[,2], 1:15, col = 4)
text(VB.no_pooling[,1], VB.no_pooling[,2], 1:15, col = 2)

legend("topright", bty = "n",
         col = c(2,4), pch = 16,
         legend = c("no pooling","partial pooling"))

Summary

Exercises

E2

Rewrite the following model as a multilevel model

No pooling Partial pooling
\[y_i \sim Binomial(1,p_i) \\ logit(p_i) \sim \alpha_{GROUP[i]} + \beta x_i \\ \color{red}{\alpha_{GROUP} \sim Normal(0,1.5)} \\ \beta \sim Normal(0,0.5)\] \[ y_i \sim Binomial(1,p_i) \\ logit(p_i) \sim \alpha_{GROUP[i]} + \beta x_i \\ \color{red}{\alpha_{GROUP} \sim Normal(\bar \alpha, \sigma_\alpha)} \\ \beta \sim Normal(0,0.5) \\ \color{red}{\bar \alpha \sim Normal(0,1.5)} \\ \color{red}{\sigma_\alpha \sim Exponential(0,1)} \]

E3

Rewrite the following model as a multilevel model

No pooling Partial pooling
\[y_i \sim Normal(\mu_i,\sigma) \\ \mu_i \sim \alpha_{GROUP[i]} + \beta x_i \\ \color{red}{\alpha_{GROUP} \sim Normal(0,5)} \\ \beta \sim Normal(0,1) \\ \sigma \sim Exponential(1) \] \[y_i \sim Normal(\mu_i,\sigma) \\ \mu_i \sim \alpha_{GROUP[i]} + \beta x_i \\ \color{red}{\alpha_{GROUP} \sim Normal(\bar \alpha, \sigma_\alpha)} \\ \beta \sim Normal(0,1) \\ \sigma \sim Exponential(1) \\ \color{red}{\bar \alpha \sim Normal(0,5)} \\ \color{red}{\sigma_\alpha \sim Exponential(1)} \]

M3

The Normal and the Cauchy distributions:

set_par(mfrow = c(2,1))
curve(dnorm(x),-7,7,ylab = "density") 
curve(dcauchy(x), col = "blue", add = T, n = 500)
legend("topleft",lty = 1,bty = "n",
       col = c("black","blue"),
       legend = c("Normal","Cauchy"))
curve(dcauchy(x)/dnorm(x), -7,7, col = "red", n = 500, log = "y")

M4

The Student-t and the Cauchy distributions:

set_par()
curve(dnorm(x),-7,7,ylab = "density", col = "grey")
curve(dstudent(x,nu = 2), add = T, col = "green3", n = 500) 
#curve(dstudent(x,nu = 10), add = T, col = "green3",n = 500, lty = 3) 
curve(dcauchy(x), col = "blue", add = T, n = 500)
legend("topleft",lty = c(1,1,1),bty = "n",
       col = c("green3","blue","gray"),
       legend = c("Student-t, df = 2","Cauchy", "Normal"))

Non centered parameterization to solve divergent iterations

centered non-centered
alist(
  S ~ dbinom(N,p),
  logit(p) <- a[tank],
  a[tank] ~ dstudent(2,a_bar,sigma),
  a_bar ~ dnorm(0,1.5),
  sigma ~ dexp(1)
  
)
alist(
   S ~ dbinom(N,p),
   logit(p) <- a[tank],
   a <- a_bar + z*sigma,
   a_bar ~ dnorm(0,1.5),
   sigma ~ dexp(1),
   z[tank] ~ dstudent(2,0,1)
)


M6

As we have seen earlier, extreme values are more probable under the Cauchy distribution.

If we then have the data \(y = 0\) and use the model

\[ y \sim Normal(\mu,1) \\ \mu \sim Normal(10,1) \]

  • the Normal prior will make low \(\mu\) values unlikely
  • the Normal likelihood will make the data given large \(\mu\) unlikely,
  • so that \(\mu\) will be estimated to be between the value of \(y\) and the mean of the prior for \(mu\) (mean of mu ~5)

If we use the model

\[ y \sim Normal(\mu,1) \\ \mu \sim Student(2,10,1) \]

  • the Student prior will make low \(\mu\) relatively more likely
  • the Normal likelihood will make the data give large \(\mu\) unlikely,
  • so that \(\mu\) will be estimated to be closer to the value of \(y\) (mean of mu ~0.3)

If we use the model

\[ y \sim Student(2,\mu,1) \\ \mu \sim Normal(10,1) \]

  • the Normal prior will make low \(\mu\) values unlikely
  • the Student likelihood will make data that deviate from \(\mu\) relatively more likely
  • so that \(\mu\) will be estimated to be close to the mean of the prior for \(\mu\) (mean of mu ~9.7)

If we use the model

\[ y \sim Student(2,\mu,1) \\ \mu \sim Student(2,10,1) \]

  • the Student prior will make low \(\mu\) relatively more likely
  • the Student likelihood will make data that deviate from \(\mu\) relatively more likely
  • so that \(\mu\) will be estimated to be between the mean of the prior for \(\mu\) and the value of \(y\) (mean of mu ~ 6.2)